Red Wine quality by Alicia Escontrela

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Plots Section

Most of the wine anlysed on this dataset have medium quality. In order to visualize this variable in context I’ll create a categorical variable named quality_factor, using this ranges: 0-4 <- Low, 5-6 <- Medium, 7-9 <- High

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "quality_factor"

Plots above shows that fixed acidity has a positive skewed distribution. It would be interesting to compare this variables with other types of acidity, as we will se below

Plot above shows that volatile acidity variable has a positive skewed distribution; where there are a lot of outliers, as we can see on boxplot

Citric acid ditribution plots shows a distribution more similar to normal than fixed and volatile acidity and that it has a lot of outliers. Also, is important to notice that x limits are similar for volatile acidity and citric acid (0-1), while data is more widespread than fixed acidity.

Free sulfure dioxide has a positive skewed distribution, that has a lot of outliers, as we can see on boxplot

Total sulfure dioxide has a positive skewed distribution, with less outliers than free sulfure dioxide.

Since free sulfur dioxide prevents microbial growth and oxidation of wine and total sulfur dioxide becomes evident in the nose and taste of the wine, as mentioned on description of variables, it would be interesting to evaluate their impact in quality rate of wine

Plots above show that sulphates present on this dataset has a positive skewed distribution, with a strong presence of outliers

% of alcohol by volume shows a positive skewed distribution that is more widespread than other variables

Plots above show that sulphates and alcohol have a positive skewed distribution, while pH has a distribution more similar to normal. Also, as states on pH description, you can notice that pH for most of wines on the dataset are between 3 and 4

Plots above show that residual sugar have a positive skewed distribution, with almost none outliers. Also, we can notice that there are more concentration of wines within first quantile and resiudal sugar and almost none above 20 g / dm^3.

Plots above show that most of wines in the dataset have a density between 0.99 and 1. We didn’t find usefull to plot a histogram to this variable because we haven’t found a resolution good enough

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations of 13 variables (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality); 11 of these variables describe chemicals properties and there is one variable that describe overall quality (0 as bad quality and 10 as high quality).

Most of the variables show a positive skewed distribution. However these variables shows a distribution more similar to normal: pH, density.

In the first exploratory analysis, we can notice that most of wines evaluated have rated as medium quality with a density very similar to each other (the difference between min and max values is about 0.052), pH from 3 to 4, residual sugar less than 20 g/dm^3 with a mean value of 6.391 g/dm^3

What is/are the main feature(s) of interest in your dataset?

The main feauture of interest in the dataset is quality. We are interested to evaluated how other chemical properties have an impact on quality rating

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest? We think chemicals properties such as acidity, pH and residual sugar would have an impact on the quality of the wine, because these variables have impact on wine taste

Did you create any new variables from existing variables in the dataset?

Yes, we created one categorical variable for quality because we found that would be more usefull to visualize giving to that variable a categorical context

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

No, we just create one new categorical variable for quality

Bivariate Plots Section

I’m interested to evaluate the relation between quality and other chemicals properties, to make anlaysis more simple I’ve created a function to graph boxplot to compare quality with another variable

We are going to start bivariate analysis by measuring correlation between variables using ggpair and ggcor as shown below

After first correlation analysis you can see that variables with stronger correlation are density/alcohol and density/residual sugar density vs alcohol (-0.8) density vs residual.sugar (0.8) Following variables show correlation around 0.4 - 0.6 total.sulfur.dioxide vs free.sulfur.dioxide (0.6) total.sulfur.dioxide vs density (0.5) alcohol vs quality (0.4) alcohol vs residual sugar (-0.5)After first bivariate analysis and found variables wit most correlation,

So, we subset the dataset for these variables and appy ggpair

Using function created above, we are going to analyse each variable in relation with quality

Is important to notice that there are a lot of outliers for volatile acidity and citric acid. Regarding citric.acid; graph above show that values are similar for all range of qualities. However, for low quality wines the difference between first and third quantile are bigger that for high quality wines Volatile acidity boxplot shows that low quality wines have higher level of this variable, which can lead to vinegar taste Fixed acidity boxplot shows a similar behaviour. We should remove outliers for a deep analysis

Graph above shows fixed acidity boxplot group by quality_factor variable removing outliers shown before

pH graph show that all wines in dataset have similars pH between 3-4 with similar behaviuor. Regarding residual sugar, graph shows that medium quality wines have higher concentration of residual sugar than high quality.

Graphs above show that relationship between density and alcohol seems to be inversely proportional; which means that when density increases concentration of alcohol decreases. For instante, high quality wine have lower density but higher percent of alcohol, while low quality wine have higher density but lower percentange of alcohol.

Plot above for density show that behaviour is pretty similar to each other for low and medium quality, while density for high quality wine tend to be lower

Boxplot above shows that there is more concetration of SO2 for medium quality wines

In order to evaluate how acid and residual sugar have an impact on quality rating we found that there is a scale created by International Riesling Foundation (http://drinkriesling.com/tasteprofile/thescale) to categorise taste of wine. As mentioned on their webpage this categories should be used by wineries.

The IRF scale is based on sugar to acid ratio and then is corrected by pH. We create a new variable calle total acid that sum up all types of acid (wine\(fixed.acidity + wine\)volatile.acidity + wine$citric.acid) and the used this variable to calculate sugar to acid ratio

## [1] 7.63 6.94 8.78 7.75 7.75 8.78
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06459 0.23495 0.72251 0.85776 1.28738 7.02616

Then we create a new categorical variable based on sugar_acid_ratio variable and the IRF scale.

##          Dry   Medium Dry Medium Sweet        Sweet 
##         3054         1515          328            1

Then we created a function using variables ratio_scale and pH to measure the IRF scale taking into account how pH modified ratio_scale variable creted above

Plots above show how the category of wine have changed taking into account pH impact in taste; which show that IRF scale changed after adding pH impact on taste

Graph above shows that there are less density for a higher percentage of alcohol. The graph shows a lineal pattern for the relationship with the most strongest correlation evaluated on ggpairs

Alcohol is the variable with strongest correlation directly related with quality. Graph above shows that percentange of alcohol increases for high quality wines, while percentage of alcohol for medium and low quality wines are pretty similar

The relationship between residual sugar and density shows that there are less concentration of residual sugar for wine with high density. However the y limits are not wide spread, as seen before the limits of density are between 0.9871 and 1.0390

Plot above shows that wines with more concentration of total sulfure dioxide tends to have more density

The relationship between free sulfure dioxide and total sulfure dioxide seems to be log10 with an increase on the total sulfure dioxide for more concrentration of free sulfure dioxide

The relation between alcohol and residual sugar are not so strong. The graph shows a tendency to decrease the concentration of alcohol for wines with less residual sugar. It would be interesting to evaluate this variables including quality and IRF scale for multivariate analysis

Plot above shows that there is more concentration of alcohol for Dry wines than medium sweet and sweet wines. It’s important to notice that the alcohol concentration for dry and medium dry wines have a similar behaviour

This graph shows a tendency of decrease pH for more concentration of fixed acidity; which makes sense because pH more acid is 0. However their correlation is not so strong, which is also seen in this graph

Graph above shows that there is more concentration on sulphates for medium quality wines

We’re going to evaluated the relationship between variable quality_factor and alcohol based on conditional means, as seen below

## # A tibble: 7 x 4
##   quality  al_mean al_median     n
##     <int>    <dbl>     <dbl> <int>
## 1       3 10.34500     10.45    20
## 2       4 10.15245     10.10   163
## 3       5  9.80884      9.50  1457
## 4       6 10.57537     10.50  2198
## 5       7 11.36794     11.40   880
## 6       8 11.63600     12.00   175
## 7       9 12.18000     12.50     5

Plot above shows that mean concentration of alcohol are bigger for high quality wines. Wines evaluated with quality of 5 are the ones with less concentration of alcochol, which means less alcohol than low quality wines

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Ggpairs analysis give us that the variable that have more strong correlation with quality is alcohol (0.482). Exploratory analysis regarding these variables show that high quality wines have more percentage of alcohol than medium and low quality wines.

Another relationships interesting to evaluate are density - residual sugar and density - residual sugar; density decreases while concentration of alcohol increases. On the other hand density increases while residual sugar increases. The first relation mentiones (density - alcohol) is mores similar to lineal, as we can see in the pattern shown in graphs above.

We found a scale that measures the relationship of sugar, acid and pH, which is IRF scale. This scale have 4 levels (Dry, Medium Dry, Medium Sweet and Sweet). After bivariate analysis, we found that high quality wines evaluated in this dataset belong to dry and medium dry, while medium quality wines also have wines that belongs to medium sweet.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

We found that there are more concentration of sulphates for medium quality wine than high quality or low quality. This chemical property acts as antinicrobial and antioxidant. So, it seems that is required to be present in certain amount. We believed that it seems to have more concentration for medium quality wines because there are more wines that belong to this category in this dataset.

What was the strongest relationship you found?

The strongest relationships found are between density-alcohol and density-residual sugar

Multivariate Plots Section

Bivariate analysis have showed that there is a correlation between alcohol and residual sugar. So, would be interesting to evaluated the relation of those variables with quality. Graph above shows that wines with more concentration of residual sugar are medium quality. For high quality wines you can see that the percentage of alcohol increase but the concentration of residual sugar is more widespread.

One relation that we found interesting on bivariate analysis was the relation between free.sulfur.dioxide/total.sulfure.dioxide and their impact on quality rating. Plot above shows that concentration of SO2 is stronger for medium quality wines with a taste dry and medium dry wines, taking into account IRF scale. So, it seems that these variables needs to be present in a certain concentration, but would have an impact on quality rating if this concentration excedees certain amount or is not enough

Since alcohol and density are the variable with strongest correlation. We found interesting to evaluated how these variables have an impact on quality rating taking into account IRF scale. Graph above shows that most of the medium sweet wines that belongs to medium quality wines have more density that high quality wines, that tends to be medium dry or dry. On the other hand, we can see that density for medium dry and dry wines have a similar behaviour for medium and high quality wines

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

We found that variable scale created to measure IRF scale have helped to evaluate the relationship between my main feautures of interest; such as residual sugar, acid and pH. Adding this variable to plot that relates with density, alcohol and variables related with SO2 have helped separating by quality could help us understand the interactions between them and their impact on quality rating. For instance, multivariate analysis has shown us that high quality wines tends to be dry and medium dry, with less concentration of SO2 than medium quality wines and high levels of alcohol percentages than other types of wines.

Were there any interesting or surprising interactions between features?

We found interesting that there are more concentration of SO2 for high quality wines. This chemical property is important to be presented on wines for microbial prevention and the oxidation of the wine and also affects the nose and taste of the wine. So, it would be interested to evaluated this variable with pH and residual sugar, which also have an impact on taste in multivariate analysis

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

No, we didn’t create a model


Final Plots and Summary

Plot One

Description One

This plot shows density plot for alcohol by categorical variable of quality. We found this plot interested to show in this section because it represents the distribution of 2 of the variables with strongest correlation. We can notice that high quality wine tend to have more % of alcohol by volume than other types of whines. Also, shows that there are more concentration of medium quality wines with lower percentage of alcohol than low quality wines. However, distribution of medium and low quality wines are pretty similar.

Plot Two

Description Two

This plot shows the relation between residual sugar and alcohol on different quality of wines. We found this plot usefull because evaluates alcohol variable, that has the strongest relationship with quality and at the same time evaluates its relationship with residual sugar (correlated with alcohol) and their impact on quality. Also helps to visualize the disribution of different rating in each quality category. As mentioned above, you can see how residual sugar are more concentrated for medium quality wines and how low quality wines tend to not be sweet.

Plot Three

Description Three

Alcohol and density have the strongest correlation in variables evaluated. Plot above shows how these variables have a negative correlation; while percentage of alcohol increases density decreases. Also, shows that most of medium and high quality wines belong to medium dry and dry scale and that density tends to be lower for dry wines than other types of wines.


Reflection

There are 4898 observations in this dataset that describe quemical properties, such as acidity, residual sugar, chlorides, density, pH, sulphates, alcohol and quality. In the exploratory analysis we have created new variables; a categorical variable for quality named quality_factor with 3 levels(Low, Medium and High) and another variables to measure IRF scale; which is an industry scale that measures taste based on acid, sugar and pH. This scale is created by International Riesling Foundation (http://drinkriesling.com/tasteprofile/thescale)

In the first exploratory for univariate analysis, we found that most of wines evaluated in this dataset belongs to medium quality. Also, we could notice the distribution of main variables of interest.

For bivariate analysis we’ve used ggpair in order to detect which variables have strongest correlation to evaluated their impact on quality rating. That was a very usefull tool in order to prioritise variables to be evaluated and their relation to each other. We found that variables with strongest correlation were density-alcohol(-0.795) and density-residual sugar (0.838). Another relationships of interest were total.sulfur.dioxide vs free.sulfur.dioxide (0.597), alcohol vs quality (0.482), alcohol vs residual sugar (0.461), total.sulfur.dioxide vs alcohol (0.445), pH vs fixed.acidity (-0.441).

We found really usefull the IRF scale; which have helped to understand interaction between most important variables evaluated (acid, residual sugar and pH). Exploratory analysis have shown us that high quality wine tends to have more percentage of alcohol, belonged to dry and medium dry type of wine, while medium quality wines tend to be more sweet with higher concentration of residual sugar and less percentage of alcohol and more concentration of SO2.

Over exploratory analysis project I’ve struggled to found the best way to visualize relationship between some variable and best resolution to show them, especially for sulphates and variables related with S02. However, we found really interesting to found the relationship between free.sulfure.dioxide/total.sulfure.dioxide and quality; which have shown us that is necesary to have because it helps to oxidation and taste of the wine, but their concentration is least on high quality wine. So, we believed that if these concentration excedees certain limits can affect negatively on quality rating.

For future work we could add a model for predict the quality of wine samples, or we also could create a model to predict the type of whine based on IRF scale. Also, we can create a new dataset combining white and red wine Vinho Verde and make a comparison how variables on the dataset differs for each type of whine